Tidy data in Excel

Brendan Clarke, NHS Education for Scotland, brendan.clarke2@nhs.scot

28/06/2024

Welcome

  • this session is for 🌶 Excel beginners
  • we’ll get going properly at 9.35
  • this is a mainly-practical session, and you’ll need Excel of some sort to follow along
  • if you can’t access the chat, you might need to join our Teams channel: tinyurl.com/kindnetwork

The KIND network

  • a social learning space for staff working with knowledge, information, and data across health, social care, and housing in Scotland
  • we offer social support, free training, mentoring, community events, …
  • Teams channel / mailing list

Where does this fit in?

KIND Excel beginner skill tree

  • for this session, you’ll need to be familiar with the Excel basics (getting around in Excel, opening/saving/closing files, and a little bit of A1 referencing)
  • we’re going to dodge formatting and formulas as much as possible today

Session outline

  • understanding tidy data
  • a word of warning
  • a practical introduction to making tidy data:
    • values
    • tables
    • autofill
  • exercises and demos

Understanding tidy data

R4DS Figure 5.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells, via CC BY-NC-ND 3.0 US

  • tidy data is not specific to Excel (e.g. this session is partly adapted from chapter 5 of R for Data Science)
  • if we tidy our data, we’ll be able to compute it
  • standards for tidying data help make for safe and effective analysis

A word of warning

  • tidying data can be very slow and complicated
  • in Excel, there are lots of advanced tools that can speed things up
    • PowerQuery especially
  • this is a beginner’s session, so we’ll avoid the more fancy tools
  • but if your process takes lots of manual work, it’s definitely worth exploring alternative ways of working

Values

  • values is the word we use to describe each bit of information in an Excel spreadsheet. Some examples:
    • a date, like 2024-06-28
    • a number, like 11.2
    • a name, like NHS Grampian
    • a cost, like £12.50
  • each value should have its own cell

Entering values

  • how would you enter this data: 2018, 2019, 2020, 2021, 2022?
  • please now:
    • open Excel
    • start a new workbook
    • add a column header year in cell A1
    • then add each of those five values in the five cells underneath (down to A6)

More values

  • we’re going to be using some birthrate data from the NRS for this session. We’ll start by adding some birth rate data

  • this is given as births per 1,000 women in five year age brackets. We’ll start with 25-29 year old mothers

  • please add the header 25-29yrs in cell B1

  • here are the values for our five years: 73.4, 71, 66.8, 69.6, 66.7

Tables

  • you should keep your new data in a table
  • Insert > Table
    Or you might prefer the Ctrl + T shortcut
  • tables allow you to sort and filter your data
  • they also act as a useful ‘container’ (or data structure) for your data

Extending tables

30-34yrs
90.9
88.6
83.4
85.9
84.4
  • please could you add some new data to your table showing births for another age group
  • you should be able to drag the blue table corner over your new column to extend the table
    blue table corner

Autofill

  • one last way of adding values: autofill
  • drag again to make a new empty column, and label it difference
  • in D2 (the first ‘proper’ cell), copy this formula: =C2-B2. This will calculate the difference in birth rates between the two columns
  • finally, click the small green corner of that newly-filled cell to autofill the column small green corner

Back to tidy data

  • we’ve now got some data with:
    • each value in a cell
    • each variable in a column
      • here, this is a maternal age bracket
    • each observation in a row
      • here, this is a year
  • we could work through and extend this data by hand, but we’ll now switch over to some supplied data to save all the typing
    • errors are common in manual data-entry
    • if you can import data, that’s usually better than re-typing it

Many values per cell

  • We often find useful data with more than one value per cell
  • this can be helpful for humans
    multiple values is good for humans, but bad for Excel
  • Excel can’t do anything with this data

Exercise one: many values per cell

  • find the Exercise one sheet in the exercise file
  • try calculating an average for each of the groups
  • or, if you’re more confident, try plotting the data

Nothing works!

Divide by zero errors in the averages
Plots produce pure nonsense

What’s the solution?

  • make this data have one value per cell
  • we could remove all the percentages by hand - but that’s going to be very slow
  • Text to Columns gives us a better tool for the job
    Text to Columns gives us a better tool for the job

Text to Columns

  • select a column
  • in the Data tab of the ribbon, you should find the Text to Columns tool
    Text to Columns tool
  • note that you can keep, or remove, the percentage column. We’ll skip it, to keep things simple
    keep, or remove, the percentage column

Try working with that data again

We can plot the data now

Exercise two: transposed data

  • sometimes you’ll find data where the columns and rows have been flipped
    data where the columns and rows have been flipped
  • that’s slow to fix by hand, but luckily you can transpose it, which swaps rows and columns
  • select your data, and copy/paste into a new cell
  • then use the transpose option
    transpose option

Exercise two: transposed data

  • you might need to fix formatting and labels: fix formatting and labels
  • the Clear formats option might help this Clear formats
  • if you run into trouble, please note that transposing only works on data that is not in a table

Demo one: validation and really messy data

  • one of the most time-consuming bits of tidying is checking your values
  • we’ll briefly demonstrate the data validation tool
    the data validation tool
  • this allows you to describe what format you think your data should be in, and then highlights anything that doesn’t fit

Demo one: validation and really messy data

  • we select the years column
  • then open the data validation tool
  • then set appropriate validation options, so Whole number between 2010 and 2030
    set appropriate validation options

Demo one: validation and really messy data

  • then select Circle Invalid Data
    select Circle Invalid Data
  • we can now go through and fix anything circled

Demo two: reshaped data

  • there are several ways of reshaping data that’s not in a tidy format
  • we’ll look at the manual way here, but - as it’s horrible - I’d be keen to encourage you to investigate Power Query or Pivot Tables to reshape if this is a regular part of your working day. PQ takes < 10 seconds, PT not much longer

Manual reshape

  • sort the data by year
    sort the data by year
  • then copy and paste blocks of data, making sure to keep the years aligned
    copy and paste blocks of data

Manual reshape

  • make sure you then copy the age brackets to label the column
    copy the age brackets to label
  • then delete the spare years columns, and the age brackets
    delete the spare years

Forthcoming Excel sessions

Session Date Area Level
Excel tables 10:00-10:30 Mon 1st July 2024 Excel 🌶 :beginner-level
Formulas in Excel 15:00-16:00 Wed 3rd July 2024 Excel 🌶 :beginner-level
Lambda formulas in Excel 13:00-13:30 Mon 15th July 2024 Excel 🌶🌶 : intermediate-level
Lookups in Excel 13:00-14:30 Thu 1st August 2024 Excel 🌶🌶 : intermediate-level
Relative, absolute, mixed, structured, and R1C1 references in Excel 15:00-16:00 Thu 8th August 2024 Excel 🌶🌶 : intermediate-level

Feedback

Feedback link

Please give us one minute of your time. We add feedback comments to our training pages, because we think this is the most useful resource for people looking for specific training that suits their needs